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Abstract: The Morse-Smale complex of a function / decomposes the sam¬ 
ple space into cells where / is increasing or decreasing. When applied to 
nonparametric density estimation and regression, it provides a way to rep¬ 
resent, visualize, and compare multivariate functions. In this paper, we 
present some statistical results on estimating Morse-Smale complexes. This 
allows us to derive new results for two existing methods: mode clustering 
and Morse-Smale regression. We also develop two new methods based on 
the Morse-Smale complex: a visualization technique for multivariate func¬ 
tions and a two-sample, multivariate hypothesis test. 
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1. Introduction 

Let / be a smooth, real-valued function defined on a compact set K S In 
this paper, / will be a regression function or a density function. The Morse- 
Smale complex of / is a partition of IK based on the gradient flow induced by /. 
Roughly speaking, the complex consists of sets, called crystals or cells, comprised 
of regions where / is increasing or decreasing. Figure 1 shows the Morse-Smale 
complex for a two-dimensional function. The cells are the intersections of the 
basins of attractions (under the gradient flow) of the function’s maxima and 
minima. The function / is piecewise monotonic over cells with respect to some 
directions. In a sense, the Morse-Smale complex provides a generalization of 
isotonic regression. 

Because the Morse-Smale complex represents a multivariate function in terms 
of regions on which the function has simple behavior, the Morse-Smale complex 
has useful applications in statistics, including in clustering, regression, testing, 
and visualization. For instance, when / is a density function, the basins of at¬ 
traction of /’s modes are the (population) clusters for density-mode clustering 
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(c) d-cell 


(d) Morse-Smale complex 


Fig 1. An example of a Morse-Smale complex. The green dots are local minima; the blue 
dots are local modes; the violet dots are saddle points. Panels (a) and (b) give examples of 
descending d-manifolds (blue region) and an ascending 0-manifold (green region). Panel (c) 
shows the corresponding d-cell (yellow region). Panel (d) is shows all d-cells. 


(also known as mean shift clustering (Fukunaga and Hostetler, 1975; Chacon 
et ah, 2015)), each of which is a union of cells from the Morse-Smale complex. 
Similarly, when / is a regression function, the cells of the Morse-Smale complex 
give regions on which / has simple behavior. Fitting / over the Morse-Smale 
cells provides a generalization of nonparametric, isotone regression; Gerber et al. 
(2013) proposes such a method. The Morse-Smale representation of a multivari¬ 
ate function / is a useful tool for visualizing /’s structure, as shown by Gerber 
et al. (2010). In addition, suppose we want to compare two multi-dimensional 
datasets X = {Xi,... ,Xn) and Y = (Yi,...,Ym)- We start by forming the 
Morse-Smale complex of p— q where p is density estimate from X and q is den¬ 
sity estimate from Y. Figure 2 shows a visualization built from this complex. 
The circles represent cells of the Morse-Smale complex. Attached to each cell is 
a pie-chart showing what fraction of the cell has p significantly larger than q. 
This visualization is a multi-dimensional extension of the method proposed for 
two or three dimensions in Duong (2013). 

For all these applications, the Morse-Smale complex needs to be estimated. 
To the best of our knowledge, no theory has been developed for this estimation 
problem, prior to this paper. We have three goals in this paper: to show that 
many existing problems can be cast in terms of the Morse-Smale complex, to 
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Fig 2. Graft-versus-Host Disease (GvHD) dataset (Brinkman et al., 2007). This is a d = 4 
dimensional dataset. We estimate the density difference based on the kernel density estimator 
and find regions where the two densities are significantly different. Then we visualize the 
density difference using the Morse-Smale complex. Each green circle denotes a d-cell, which 
is a partition for the support K. The size of circle is proportional to the size of cell. If two 
cells are neighborhors, we add a line connecting them; the thickness of the line denotes the 
amount of boundary they share. The pie charts show the ratio of the regions within each cell 
where the two densities are significantly different from each other. See Section 3.4 for more 
details. 


develop some new statistical methods based on the Morse-Smale complex, and 
to develop the statistical theory for estimating the complex. 


Main results. The main results of this paper are: 

1. Consistency of the Morse-Smale Complex. We prove the stability of the 
Morse-Smale complex (Theorem 1) in the following sense: if B and B are 
boundaries of the descending d-manifolds (or ascending 0-manifolds) of p 
and p (defined in Section 2) , then 

Haus(B,i3) = 0(||Vp-Vp|U). 

2. Risk Bound for Mode clustering (mean-shift clustering; section 3.1): We 
bound the risk of mode clustering in Theorem 2. 

3. Morse-Smale regression (section 3.2): In Theorems 4 and 5, we bound the 
risk of Morse-Smale regression, a multivariate regression method proposed 
in Gerber et al. (2010); Gerber and Potter (2011); Gerber et al. (2013) 
that synthesizes nonparametric regression and linear regression. 

4. Morse-Smale signatures (section 3.3): We introduce a new visualization 
method for densities and regression functions. 

5. Morse-Smale two-sample testing (section 3.)): We develop a new method 
for multivariate two-sample testing that can have good power. 

Related work. The mathematical foundations for the Morse-Smale complex 
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Fig 3. A one dimensional example. The blue dots are local modes and the green dots are 
local minima. Left panel: the basins of attraction for two local modes are colored by brown 
and orange. Middle panel: the basin of attraction (negative gradient) for the local minima are 
colored by red, purple and violet. Right panel: The intersection of the basins, which are called 
d-cells. 


are from Morse theory (Morse, 1925, 1930; Milnor, 1963). Morse theory has 
many applications including computer vision (Paris and Durand, 2007), com¬ 
putational geometry (Cohen-Steiner et ah, 2007) and topological data analysis 
(Chazal et ah, 2014). 

Previous work on the stability of the Morse-Smale complex can be found in 
Chen et al. (2016) and Chazal et al. (2014) but they only consider critical points 
rather than the whole Morse-Smale complex. Arias-Castro et al. (2016) prove 
pointwise convergence for the gradient ascent curves but this is not sufficient 
for proving the stability of the complex because the convergence of complexes 
requires convergence of multiple curves and the constants in the convergence 
rate derived from Arias-Castro et al. (2016) vary from points to points and some 
constants diverge when we are getting closer to the boundaries of complexes. 
Thus, we cannot obtain a uniform convergence of gradient ascent curves directly 
based on their results. Morse-Smale regression and visualization were proposed 
in Gerber et al. (2010); Gerber and Potter (2011); Gerber et al. (2013). 

The R code (Algorithm 1, 2, and 3) used in this paper can be found at 
https://github.com/yenchic/Morse_Smale. 

2. Morse Theory 

To motivate formal definitions, we start with the simple, one-dimensional ex¬ 
ample depicted in Figure 3. The left panel shows the sets associated with each 
local maximum (i.e. the basins of attraction of the maxima). The middle panel 
shows the sets associated with each local minimum. The right panel show the 
intersections of these basins, which gives the Morse-Smale complex defined by 
the function. Each interval in the complex, called a cell, is a region where the 
function is increasing or decreasing. 

Now we give a formal definition. Let / : K C i—t K be a function with 

bounded third derivatives that is defined on a compact set K. Let g{x) = V/(a;) 
and H{x) = VV/(a;) be the gradient and Hessian matrix of /, respectively, and 
let Xj{x) be the jth largest eigenvalue of H{x). Define C = {x G K : g{x) = 0} 
to be the set of all /’s critical points, which we call the critical set. Using the 
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signs of the eigenvalues of the Hessian, the critical set C can be partitioned into 
d+\ distinct subsets Co, • • ■ ,C^, where 

Cfe = {a; e K : 5 (a:) = 0, Afe(a;) > 0, Afc+i(a;) < 0}, fc = 1, • • • , d - 1. (1) 

We dehne Cq, Cd to be the sets of all local maxima and minima (corresponding 
to all eigenvalues being negative and positive respectively). The set Ck is called 
fc—th order critical set. 

A smooth function / is called a Morse function (Morse, 1925; Milnor, 1963) 
if its Hessian matrix is non-degenerate at each critical point. That is, |Aj(x)| > 
0,Va; G C for all j. In what follows we assume / is a Morse function (actually, 
later we will assume further that / is a Morse-Smale function). 

Given any point a; S K, we define the gradient ascent flow starting at x, 
TTa; : K"*' I—>■ K, by 

*-<“> = * 

=5(7r(t)). 

A particle on this flow moves along the gradient from x towards a “destination” 
given by 

dest(a;) = lim Trj;{t). 

t—¥00 

It can be shown that dest(a;) S C for x gK. 

We can thus partition K based on the value of dest(a:). These partitions are 
called descending manifolds in Morse theory (Morse, 1925; Milnor, 1963). Recall 
Ck is the k-th. order critical points, we assume Ck = {ck,i, ■ ■ ■ ,Ck,mk} contains 
ruk distinct elements. For each k, define 

Dk = {x: dest(x) G Cd-k} 

Dk,j = {x : dest(a;) = Cd-k,j} , j = 1 , • • • rud-k- 

That is, Dk is the collection of all points whose gradient ascent flow converges to 
a (d—/c)-th order critical point and Dkj is the collection of points whose gradient 
ascent flow converges to the j-th element of Cd-k - Thus, Dk = From 

Theorem 4.2 in Banyaga and Hurtubise (2004), each Dk is a disjoint union 
of fc-dimensional manifolds {Dkj is a A:-dimensional manifold). We call Dkj 
a descending k-manifold of /. Each descending k-manifold is a A:-dimensional 
manifold such that the gradient flow from every point converges to the same 
(d — fc)-th order critical point. Note that {Dq, • • • , Dk} forms a partition of K. 
The top panels of Figure 4 give an example of the descending manifolds for a 
two dimensional case. 

The ascending manifolds are similar to descending manifolds but are defined 
through the gradient descent flow. More precisely, given any a; G K, a gradient 
descent flow 7 : 1 R+ 1 —)■ K starting from x is given by 

7a; ( 0 ) = X 
7xW = -fl(7r(i))- 


( 4 ) 
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(c) (d) 

Fig 4. Two-dimensional examples of critical points, descending manifolds, ascending mani¬ 
folds, and 2-cells. This is the same function as Figure 1. (a): The set Ck for k = 0,1, 2. The 
four blue dots are Co, the collection of local modes (each of them is cqj some j = 1, - ■ ■ ,A). 
The four orange dots are Ci, the collection of saddle points (each of them is cij for some 
j = I,-- - ,4^. The green dots are C 2 , the collection of local minima (each green dot is C 2 j 
for some j = 1, • • • ,9). (b): The set for k = 0,1, 2. The yellow area is D 2 (each subregion 
separated by blue curves are D 2 ,j,j = I,-- - ,4/ The two blue curves are D\ (each of the 4 
blue segments are Dij,j = 1, • • • ,4J. The green dots are Dq (also C 2 ), the collection of local 
minima (each green dot is Dqj for some j = 1, - • • ,9). (b): The set for k = 0,1, 2. The 
yellow area is Aq (each subregion separated by red curves are = I,-- - ,9J. The two 

red curves are A\ (each of the 4 segments are Aij,j = 1, • • • ,4). The blue dots are A 2 
(also Co), the collection of local modes (each green dot is Aqj for some j = !,••• ,4). (d): 
Example for 2-cells. The thick blue curves are Di and thick red curves are Ai. 
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Unlike the ascending flow defined in (2), is a flow that moves along the 
gradient descent direction. The descent flow 7 ^; shares similar properties to the 
ascent flow -Kx] the limiting point lim(_>.oo 7 a;(t) S C is also in critical set when 
/ is a Morse function. Thus, similarly to Dk and Dk^j, we define 
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Ak and Ak^j have dimension d — k and each Akj is a partition for Ak and 
{Aq, • • • , All} consist of a partition for K. We call each Akj an ascending k- 
manifold to /. 

A smooth function / is called a Morse-Smale function if it is a Morse function 
and any pair of the ascending and descending manifolds of / intersect each 
other transversely (which means that pairs of manifolds are not parallel at their 
intersections); see e.g. Banyaga and Hurtubise (2004) for more details. In this 
paper, we also assume that / is a Morse-Smale function. Note that by the 
Kupka-Smale Theorem (see e.g. Theorem 6.6 in Banyaga and Hurtubise (2004)), 
Morse-Smale functions are generic (dense) in the collection of smooth functions. 
For more details, we refer to Section 6.1 in Banyaga and Hurtubise (2004). 

A k-cell (also called Morse-Smale cell or crystal) is the non-empty intersection 
between any descending /ci-manifold and an ascending {d — fc 2 )"ma'nifold such 
that k = minjfci, ^ 2 } (the ascending (d— fc 2 )-manifold has dimension ^ 2 ). When 
we simply say a cell, we are referring to the d-cell since d-cells consists of the 
majority of K (the totality of fc-cells with k < d has Lebesgue measure 0). The 
Morse-Smale complex for / is the collection of all fc-cells for k = 0, - ■ ■ ,d. The 
bottom panels of Figure 4 give examples for the ascending manifolds and the 
d-cells for d = 2. Another example is given in Figure 1. 

The cells of a smooth function can be used to construct an additive de¬ 
composition that is useful in data analysis. For a Morse-Smale function /, let 
El, - ■ ■ ,El be its associated cells. Then we can decompose / into 


L 
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where each fi{x) behaves like a multivariate isotonic function (Barlow et ah, 
1972; Bacchetti, 1989). Namely, f{x) = fi[x) when x S E^. This decomposition 
is because within each E^, f has exact a local mode and a local minimum on 
the boundary of Eg. The fact that / admits such a decomposition will be used 
frequently in Section 3.2 and 3.3. 

Among all descending/ascending manifolds, the descending d-manifolds and 
the ascending 0-manifolds are often of great interest. For instance, mode cluster¬ 
ing (Li et ah, 2007; Azzalini and Torelli, 2007) uses the descending d-manifolds 
to partition the domain IK into clusters. Morse-Smale regression (Gerber and 
Potter, 2011; Gerber et ah, 2013) fits a linear regression individually over each 
d-cell (non-empty intersection of pairs of descending d-manifolds and ascending 
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(a) Basins of attraction 



(b) Gradient ascent 



Fig 5. An example of mode clustering, (a): Basin of attraction for each local mode (red 
Black dots are data points, (b): Gradient flow (blue lines) for each data point. The gradient 
flow starts at one data point and ends at one local modes, (c): Mode clustering; we use the 
destination for gradient flow to cluster data points. 


0-manifolds). Regions outside descending d-manifolds or ascending 0-manifolds 
have Lebesgue measure 0. Thus, later in our theoretical analysis, we will focus 
on the stability of the set and Aq (see Section 4.1). We define boundaries of 
Dd as 

B = dDd = Dd-i U • • • U Dq. (7) 

The set B will be used frequently in Section 4. 

3. Applications in Statistics 
3.1. Mode Clustering 

Mode clustering (Li et ah, 2007; Azzalini and Torelli, 2007; Chacon and Duong, 
2013; Arias-Castro et ah, 2016; Chacon et al., 2015; Chen et al., 2016) is a 
clustering technique based on the Morse-Smale complex and is also known as 
mean-shift clustering (Fukunaga and Hostetler, 1975; Cheng, 1995; Comaniciu 
and Meer, 2002). Mode clustering uses the descending d-manifolds of the density 
function p to partition the whole space K. (Although the d-manifolds do not 
contain all points in K, the regions outside d-manifolds have Lebesgue measure 
0). See Figure 5 for an example. 

Now, we briefly describe the procedure of mode clustering. Let X = {Ai, • • • , A„ 
be a random sample from density p defined on a compact set K and assumed to 
be a Morse function. Recall that dest(a::) is the destination of the gradient ascent 
flow starting from x. Mode clustering partitions the sample based on dest(a;) 
for each point; specifically, it partitions X = Ai |J • • • IJ Xk such that 

Xi = {Xi e X : dest(Aj) = mg}, 

where each mg is a local mode of p. We can also view mode clustering as a cluster¬ 
ing technique based on the d-descending manifolds. Let Dd = Dd^i U '' ’ U B>d,L 
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be the d-descending manifolds of p, assuming that L is the number of local 
modes. Then each cluster = T” p| 

In practice, however, we do not know p so we have to use a density estimator 
Pn- A common density estimator is the kernel density estimator (KDE): 



( 8 ) 


where it' is a smooth kernel function and h > 0 is the smoothing parameter. 
Note that mode clustering is not limited to the KDE; other density estimators 
also give us a sample-based mode clustering. Based on the KDE, we are able to 
estimate gradient 'gn{x), the gradient flows and the destination dest„(a;) 

(note that the mean shift algorithm is an algorithm to perform these tasks). 
Thus, we can estimate the d-descending manifolds by the plug-in from Let 
Dd = Dd^i U ■ ■ ■ U L d-descending manifolds of p„, where L is the 

number of local modes of Pn- The estimated clusters will be Ai, • • • , Ag, where 
each A^ = A Pi Dd^g- Figure 5 displays an example of mode clustering using the 


KDE. 


A nice property of mode clustering is that there is a clear population quan¬ 
tity that our estimator (clusters based on the given sample) is estimating: the 
population partition of the data points. Thus we can consider properties of the 
procedure such as consistency, which we discuss in detail in Section 4.2. 

3.2. Morse-Smale Regression 

Let (A, K) be a random pair where A G K and Xi G K C M'^. Estimating the 
regression function m{x) = E[A|A = x] is challenging for d of even moderate 
size. A common way to address this problem is to use a simple regression function 
that can be estimated with low variance. For example, one might use an additive 
regression of the form m(x) = mj{xj) which is a sum of one-dimensional 
smooth functions. Although the true regression function is unlikely to be of this 
form, it is often the case that the resulting estimator is useful. 

A different approach, Morse-Smale regression (MSR), is suggested in Gerber 
et al. (2013). This takes advantage of the (relatively) simple structure of the 
Morse-Smale complex and the isotone behavior of the function on each cell. 
Specifically, MSR constructs a piecewise linear approximation to m{x) over the 
cells of the Morse-Smale complex. 

We first define the population version of the MSR. Let m(x) = E(A|A = x) 
be the regression function and is assumed to be a Morse-Smale function. Let 
El, - ■ ■ El be the d-cells for to. The Morse-Smale Regression for to is a piecewise 
linear function within each cell Eg such that 


?T^MSR(a::) = IJ-l + Pjx, for X G Eg, 


(9) 
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where are obtained by minimizing mean square error: 

=argmin E ((F - G Ei) 


argmin E {{Y - ft - l3^Xf\X £ E^) 


( 10 ) 


That is, tomsr is the best linear piecewise predictor using the d-cells. One can 
also view MSR as using a linear function to approximate fi in the additive 
model (6). Note that ttimsr is well defined except on the boundaries of that 
have Lebesgue measure 0. 

Now we define the sample version of the MSR. Let {Xi,Yi), • • • , (X„, F„) be 
the random sample from the probability measure Px x Pv such that G K C 
and Y^ £ K. Throughout section 3.2, we assume the density of covariates X 
is bounded, positive and has a compact support IK and the response Y has finite 
second moment. 

Let fhn be a smooth nonparametric regression estimator for m. We call 
the pilot estimator. For instance, one may use the kernel regression Nadaraya 



as El, - ■ ■ , Ej^. Using the data (Xi, Yi) within each estimated d-cell, Ei, the 
MSR for rhn is given by 


mn,MSR{x) ='p-t + '^x, for a: G 

where are obtained by minimizing the empirical squared error: 


( 11 ) 



( 12 ) 


This MSR is slightly different from the original version in Gerber et al. (2013). 
We will discuss the difference in Remark 1. Computing the parameters of MSR 
is not very difficult-we only need to compute the cell labels of each observation 
(this can be done by the mean shift algorithm or some fast variants such as the 
quick-shift algorithm Vedaldi and Soatto 2008) and then fit a linear regression 
within each cell. 

MSR may give low prediction error in some cases; see Gerber et al. (2013) for 
some concrete examples. In Theorem 5, we prove that we may estimate mMSR at 
a fast rate. Moreover, the regression function may be visualized by the methods 
discussed later. 

Remark 1 The original version of Morse-Smale regression proposed in Gerber 
et al. (2013) does not use d-cells of a pilot nonparametric estimate fhn- Instead, 
they directly find local modes and minima using the original data points {Xi,Yi). 
This saves computational effort but comes with a price: there is no clear popu¬ 
lation quantity being estimated by their approach. That is, when the sample size 
increases to infinity, there is no guarantee that their method will converge. In 
our case, we apply a consistent pilot estimate for m and construct d-cells on 
this pilot estimate. As is shown in Theorem 4, our method is consistent for this 
population quantity. 
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3.3. Morse-Smale Signatures and Visualization 

In this section we define a new method for visualizing multivariate functions 
based on the Morse-Smale complex, called Morse-Smale signatures. The idea is 
very similar to the Morse-Smale regression but the signatures can be applied to 
any Morse-Smale function. 

Let El,-- - ,Ex be the d-cells (nonempty intersection of a descending d- 
manifold and an ascending 0-manifold) for a Morse-Smale function / that has 
a compact support K. The function / depends on the context of the problem. 
For density estimation, / is the density p or its estimator pn- For regression 
problem, / is the regression function m or a nonparametric estimator fhn ■ 
For two sample test, / is the density difference pi — P 2 or the estimated density 
difference pi —p 2 ■ Note that Ei, - - - , Ek form a partition for K except a Lebesgue 
measure 0 set. Each cell corresponds to a unique pair of a local mode and a local 
minimum. Thus, the local modes and minima along with d-cells form a bipartite 
graph which we call it signature graph. The signature graph contains geometric 
information about /. See Figure 6 and 7 for examples. 

The signature is dehned as follows. We project the maxima and minima of 
the function into using multidimensional scaling. We connect a maximum 
and minimum by an edge if there exists a cell that connects them. The width 
of the edge is proportional to the norm of the linear coefficients of the linear 
approximation to the function within the cell. The linear approximation is 

fMs{x) =r]l-\- lY ^ ^ (13) 

where r/J € K and yj € are parameters from 

= argmin / (/(a;) - 77 - dx. (14) 

r),l J Et 

This is again a linear approximation for in the additive model ( 6 ) . Note that 
/ms may not be continues when we move from one cell to another. The summary 
statistics for the edge associated with cell Ei are the parameters {ril,Y)- We 
call the function /ms the (Morse-Smale) approximation function; it is the best 
piecewise-linear representation for / (piecewise linear within each cell) under £2 
error given the d-cells. This function is well-defined except on a set of Lebesgue 
measure 0 (the boundaries of each cell). See Figure 6 for a example on the 
approximation function. The details are in Algorithm 1 . 

Example. Figure 7 is an example using the GvHD dataset. We first conduct 
multidimensional scaling (Kruskal, 1964) on the local modes and minima for / 
and plot them on the 2-D plane. In Figure 7, the blue dots are local modes and 
the green dots are local minima. These dots act as the nodes for the signature 
graph. Then we add edges, representing the cells for / that connect pairs of local 
modes and minima, to form the signature graph. Lastly, we adjust the width 
for the edges according to the strength (£2 norm) of regression function within 
each cell (i.e. ||y|||). Algorithm 1 provides a summary for visualizing a general 
multivariate function using what we described in this paragraph. 
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(a) Original function (b) Approximation function 



1 2 3 4 5 6 


(c) Signature graph 


Fig 6. Morse-Smale signatures for a smooth function, (a): The original function. The blue 
dots are local modes, the green dots are local minima and the pink dot is a saddle point, (h): 
The Morse-Smale approximation to (a). This is the best piecewise linear approximation to the 
original function given d-cells. (c): This bipartite graph has nodes that are local modes and 
minima and edges that represent the d-cells. Note that we can summarize the smooth function 

(a) by the signature graph (c) and the parameters for constructing approximation function 

(b) . The signature graph and parameters for approximation function define the Morse-Smale 
signatures. 


Algorithm 1 Visualization using Morse-Smale Signatures 

Input: Grid points xi, • • • ,xn and the functional evaluations f{xi), • • • , fix^). 

1. Find local modes and minima of / on the discretized points xi, - • • ,xn. Let Mi, • • • Mk 
and mi, • • • , ms denote the grid points for modes and minima. 

2. Partition {xi, • • • , into Ai, • • • according to the d-cells of / (1. and 2. can be done 
by using a k-nearest neighbor gradient ascent/descent method; see Algorithm 1 in Gerber 
et al. (2013)). 

3. For each cell A/, fit a linear regression with (Aj, V) = (xi, f{xi)), where Xi E A^. Let 
the regression coefficients (without intercept) be 

4. Apply multidimensional scaling to modes and minima jointly. Denote their 2 dimensional 
representation points as 

5. Plot {M*, ■ ■ ■ M^, m^, ■ ■ ■ , mg}. 

6. Add edge to a pair of mode and minimum if there exist a cell that connects them. The 

width of the edge is in proportional to (for cell A^). 
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Fig 7. Morse-Smale Signature visualization (Algorithm 1) of the density difference for GvHD 
dataset (see Figure 2). The blue dots are local modes; the green dots are local minima; the 
brown lines are d-cells. These dots and lines form the signature graph. The width indicates 
the C 2 norm for the slope of regression coefficients, i.e. || 7 |||. The location for modes and 
minima are obtained by multidimensional scaling so that the relative distance is preserved. 


3.4- Two Sample Comparison 

The Morse-Smale complex can be used to compare two samples. There are two 
ways to do this. The first one is to test the difference in two density functions 
locally and then use the Morse-Smale signatures to visualize regions where the 
two samples are different. The second approach is to conduct a nonparametric 
two sample test within each Morse-Smale cell. The advantage of the first ap¬ 
proach is that we obtain a visual display on where the two densities are different. 
The merit of the second method is that we gain additional power in testing the 
density difference by using the shape information. 


3 . 4 . 1 . Visualizing the Density Difference 

Let Xi,... Xn and Li,..., be two random sample with densities px and py- 
In a two sample comparison, we not only want to know if px = py but we also 
want to find the regions that they significantly disagree. That is, we are doing 
the local tests 

Hoix) : px{x) = py{x) (15) 

simultaneously for all a; G IK and we are interested in the regions where we reject 
Hq{x). a common approach is to estimate the density for both sample by the 
KDE and set a threshold to pickup those regions that the density difference is 
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large. Namely, we first construct density estimates 

2 = 1 ^ ^ 2 = 1 ^ ^ 

and then compute f{x) = Pxi^) — Py{^)- The regions 

r(A) = {xeK:|7(x)|>A} (17) 

are where we have strong evidence to reject Hq{x). The threshold A can be 
picked by quantile values of the bootstrapped Coo density deviation to control 
type 1 error or can be chosen by controlling the false discovery rate (Duong, 
2013). 

Unfortunately, r(A) is hard to visualize when d > 3. So we use the Morse- 
Smale complex for / and visualize r(A) by its behavior on the d-cells of the 
complex. Algorithm 2 gives a method for visualizing density differences like 
r(A) in the context of comparing two independent samples. 


Algorithm 2 Visualization For Two Sample Test 

Input: Sample 1: {Xi, ...Xti}, Sample 2: {Yi, ■ ■ ■ , Ym}, threshold A and radius constant vq 

1. Compute the density estimates and py- 

2. Compute the difference function / = px — Py f-he significant regions 

r+(A) = p e K ;/(x) > a} , r-(A) = |x e K:/(x) <-a} (18) 

3. Find the d-cells for /, denoted as • • • , Ei,. 

4. For cell do (4-1) and (4-2): 

4-1. compute the cell center e^, cell size Vi = Vol(F'£), 

4-2. compute the positive significant ratio and negative significant ratio 

, _ Voi(i;,nr+(A)) _ Voi(E^nr-(A)) 

^ VolfBU ’ ^ yo\(Ei) ■ ^ ' 

5. For every pair of cell Ej and E^ {j ^ i), compute the shared boundary size: 

Bji = yo\d-i{Ej n Ep), (20) 

where Volcf_i is the d—1 dimensional Lebesgue measure. 

6. Do multidimensional scaling (Kruskal, 1964) to ei,--- , to obtain low dimensional 
representation ei, • • • , ej,. 

7. Place a ball center at each with radius ro x y/Vg. 

8. If 'f'J > 0, add a pie chart center at with radius vq x \/VJ x (r^ + The pie 

chart contains two groups, each with ratio 

9. Add a line to connect two nodes ej and if Bj^ > 0. We may adjust the thickness of 
the line according to Bj(^. 



An example for Algorithm 2 is in Figure 2, in which we apply the visualization 
algorithm for the the GvHD dataset by using kernel density estimator. We 
choose the threshold A by bootstrapping the Coo difference for f i.e. sup^ |/*(x) — 
/(x)|, where f* is the density difference for the bootstrap sample. We pick 
a = 95% npper quantile valne for the bootstrap deviation as the threshold. 
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The radius constant tq is defined by the user. It is a constant for visualiza¬ 
tion and does not affect the analysis. Algorithm 2 preserves the relative position 
for each cell and visualizes the cell according to its size. The pie-chart provides 
the ratio of regions where the two densities are significantly different. The lines 
connecting two cells provide the geometric information about how cells are con¬ 
nected to each other. 

By applying Algorithm 2 to the GvHD dataset (Figure 2), we find that there 
are 6 cells and one cell much larger than the others. Moreover, in most regions, 
the blue regions are larger than the red areas. This indicates that compared 
to the density of the control group, the density of the GvHD group seem to 
concentrates more so that the regions above the threshold are larger. 


3.4-2. Morse-Smale Two-Sample Test 


Here we introduce a technique combining the energy test (Baringhaus and Franz, 
2004; Szekely and Rizzo, 2004, 2013) and the Morse-Smale complex to conduct 
a two sample test. We call our method the Morse-Smale Energy test (MSE test). 
The advantage of the MSE test is that it is a nonparametric test and its power 
can be higher than the energy test; see Figure 8. Moreover, we can combine 
our test with the visualization tool proposed in the previous section (Algorithm 
2); see Figure 9 for an example for displaying p-values from MSE test when 
visualizing the density difference. 

Before we introduce our method, we first review the ordinary energy test. 
Given two random variables A G and E G the energy distance is defined 
as 

£{X,Y) = 2E||A-E|| -E|jA-A'|| -E||r-E'||, (21) 

where X' and Y' are iid copies of X and Y. The energy distance has several 
useful applications such as the goodness-of-fit testing (Szekely and Rizzo, 2005), 
two sample testing (Baringhaus and Franz, 2004; Szekely and Rizzo, 2004, 2013), 
clustering (Szekely and Rizzo, 2005), and distance components (Rizzo et al., 
2010) to name but few. We recommend an excellent review paper in (Szekely 
and Rizzo, 2013). 

For the two sample test, let Ai, • • • , A„ and Ei, • • • , Y^ be the two samples 
we want to test. The sample version of energy distance is 


j-.nm .nn -mm 

£(A, E) = — ^ ^ II A, - E, II - - ^ ^ II A, - A, II - — ^ ^ ||E, - Y, 


i=i j=i 


i=i j=i 


i=l 3 = 1 


( 22 ) 


^ p 

If A and E are from the sample population (the same density), £{X,Y) 0. 

Numerically, we use the permutation test for computing the p-value for £ (A, E). 
This can be done quickly in the R-package ‘energy’ (Rizzo and Szekely, 2008). 

Now we formally introduce our testing procedure: the MSE test (see Algo¬ 
rithm 3 for a summary). Our test consists of three steps. First, we split the data 
into two halves. Second, we use one half of the data (contains both samples) 
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to do a nonparametric density estimation (e.g. the KDE) and then compute 
the Morse-Smale complex (d-cells). Last, we use the other half of the data to 
conduct the energy distance two sample test ‘within each d-cell’. That is, we 
partition the second half of the data by the d-cells. Within each cell, we do the 
energy distance test. If we have L cells, we will have L p-values from the energy 
distance test. We reject Hq if any one of the L p-values is smaller than a/L 
(this is from Bonferroni correction). Figure 9 provides an example for using the 
above procedure (Algorithm 3) along with the visualization method proposed 
in Algorithm 2. Data splitting is used to avoid using the same data twice, which 
ensures we have a valid test. 


Algorithm 3 Morse-Smale Energy Test (MSE test) 

Input: Sample 1: {Xi, ...Xn}, Sample 2: {Xi, • • • , Ym}, smoothing parameter h, significance 
level a 

1. Randomly split the data into halves Di and T) 2 ‘, both contain equal number of X and Y 
(assuming n and m are even). 

2. Compute the KDE px and py by the first sample Di. 

3. Find the d-cells for / = px — Pv, denoted as Ei, • • • , E^. 

4. For cell do 4-1 and 4-2: 

4-1. Find X and Y in the second sample D 2 , 

4-2. Do the energy test for two sample comparison. Let the p-value be p{£) 

5. Reject Hq if p(i) < a/L for some £. 


Example. Figure 8 shows a simple comparison for the proposed MSE test to 
the usual Energy test. We consider a AT = 4 Gaussian mixture model in d = 2 
with standard deviation of each component being the same a — 0.2 and the 
proportion for each component is (0.2,0.5, 0.2,0.1). The left panel displays a 
sample with N = 500 from this mixture distribution. We draw the first sample 
from this Gaussian mixture model. For the second sample, we draw a similar 
Gaussian mixture model except that we change the deviation of one component. 
In the middle panel, we change the deviation to the third component (C3 in 
left panel, which contains 20% data points). In the right panel, we change the 
deviation to the fourth component (C4 in left panel, which contains 10% data 
points). We use significance level a = 0.05 and for MSE test, we consider the 
Bonferroni correction and the smoothing bandwidth is chosen using Silverman’s 
rule of thumb (Silverman, 1986). 

Note that in both the middle and the right panels, the left most case (added 
deviation equals 0) is where Hq should not be rejected. As can be seen from 
Figure 8, the MSE test has much stronger power compared to the usual Energy 
test. 

The original energy test has low power while the MSE test has higher power. 
This is because the two distributions only differ at a small portion of the regions 
so that a global test like energy test requires large sample sizes to detect the 
difference. On the other hand, the MSE test partitions the space according to 
the density difference so that it is capable of detecting the local difference. 

Example. In addition to the higher power, we may combine the MSE test 
with the visualization tool in Algorithm 2. Figure 9 displays an example where 
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Change Deviation of C3 



Change Deviation of C4 



Fig 8. An example comparing the Morse-Smale Energy test to the original Energy test. We 
consider a d = 2, K = 4 Gaussian mixture model. Left panel: an instance for the Gaussian 
mixture. We have four mixture components, denoting as Gl, G2, G3 and G4- They have equal 
standard deviation (cr = 0.2) and the proportions for each components are (0.2,0.5, 0.2, 0.1). 
Middle panel: We changed the standard deviations of component G3 to 0.3, 0.4 and 0.5 and 
compute the power for the MSE test and the usual Energy test at sample size N = 500 and 
1000. (Standard deviation equals 0.2 is where Hq should not be rejected.) Right panel: We 
add the variance of component G4 (the smallest component) and do the same comparison as 
in the middle panel. We pick the significance level a = 0.05 (gray horizontal line) and in the 
MSE test, we reject Hq if the minimal p-value is less than a/L, where L is the number of 
cells (i.e. we are using the Bonferroni correction). 


we visualize the density difference and simultaneously indicate the p-values from 
the Energy test within each cell using the GvHD dataset. This provides us more 
information about how two distributions differ from each other. 

4. Theoretical Analysis 

We first define some notation for the theoretical analysis. Let / be a smooth 
function. We define ||/||oo = sup^ |/(^)| to be the £oo-norm of /. In addition, let 
II/II j,max denote the elementwise £oo-iiorm for j-tYv derivatives of /. For instance, 

II / II l,max — max |j^2(^)||cxD5 ll/ll 2, max — maX || ( 2 ^) || cxd ■ 

We also define ||/||o,max = II/II 00 • We further define 

ll/lll,max = niax{||/||j-„iax : j = 0, •••,€} . (23) 

The quantity ||/ — ^|||max measures the difference between two functions / and 
h up to £-th order derivative. 

For two sets A, B, the Hausdorff distance is 

Haus(A, B) = inf{r : A C B (B r, B C A (B r}, (24) 

where A(Br = {y : niina;gA \\x — y\\ < r}. The Hausdorff distance is like the £oo 
distance^ for sets. 

Let / : K C I—)■ M be a smooth function with bounded third derivatives. 
Note that as long as ||/—/Ha^max is small, / is also a Morse function by Lemma 9. 
Let D denote the boundaries of the descending d-manifolds of /. We will show 
if 11/ - /lls.maxis sufficiently small, then Haus(D,D) = 0(|i/- /||i.max)- 
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Fig 9. An example using both Algorithm 2 and 3 to the GvHD dataset introduced in Figure 2. 
We use data splitting as described in Algorithm 3. For the first part of the data, we compute 
the cells and visualize the cells using Algorithm 2. Then we apply the energy distance two 
sample test for each cell as described in Algorithm 3 and we annotate each cell with a p- 
value. Note that the visualization is slightly different to Figure 2 since we use only half of the 
original dataset in this case. 


4 . 1 . stability of the Morse-Smale Complex 

Before we state our theorem, we first derive some properties of descending mani¬ 
folds. Recall that we are interested in B = dDd, the boundary of the descending 
d-manifolds (and B is also the union of all j-descending manifolds for j < d). 
Since each Dj is a collection of smooth j-dimensional manifolds embedded in 
for every x G Dj, there exists a basis ui(a;), • • • , Vd-j{x) such that each Vk{x) 
is perpendicular to Dj at a; for fc = 1, • • • d — j (Bredon, 1993; Helgason, 1979). 
That is, ui(a:), • • • , Vd-j{x) span the normal space to Dj at x. For simplicity, we 
write 

V{x) = (ui(a:), • • • ,Vd-j{x)) G (25) 

for X £ Dj. 

Note the number of columns d — j = d — j (x) in V(x) depends on which Dj 
the point x belongs to. We use j rather than j{x) to simplify the notation. For 
instance, if a; G I?i, V{x) G and if x G Dd-i, V(x) G We also 

let 

V(x) = span{wi(x),-• • ,Vd-j{x)} (26) 

denote the normal space to B at x. One can view V(x) as the normal map of 
the manifold Dj at x G Dj. 

For each x G R, define the projected Hessian 

Hv{x) = V{xf H{x)V{x), (27) 

which is the Hessian matrix of p by taking gradients along column space of 
V{x). If X G Dj, Hv{x) is a {d — j) x (d — j) matrix. The eigenvalues of Hv{x) 
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determine how the gradient flows are moving away from B. We let Amin(M) be 
the smallest eigenvalue for a symmetric matrix M. If M is a scalar (just one 
point), then Aniin(-^) = Af. 


Assumption (D): We assume that ifmin = uiina;^^ > 0. 


This assumption is very mild; it requires that the gradient flow moves away 
from the boundary of ascending manifolds. In terms of mode clustering, this 
requires the gradient flow to move away from the boundaries of clusters. For a 
point X € Dd-i, let 1 ^ 1 ( 0 ;) be the corresponding normal direction. Then the gradi¬ 
ent g{x) is normal to vi{x) by definition. That is, vi^xY"g{x) = vi{x)'^Wp{x) = 
0, which means that the gradient along ni(a;) is 0. Assumption (D) means that 
the the second derivative along vi{x) is positive, which implies that the density 
along direction vi{x) behaves like a local minimum at point x. Intuitively, this 
is how we expect the density to behave around the boundaries: gradient flows 
are moving away from the boundaries (except for those flows that are already 
on the boundaries). 

Theorem 1 (Stability of descending d-manifolds) Let /, / : K C !-)■ K 
be two^smooth functions with bounded third derivatives defined as above and 
let B, B be the boundaries of the associated ascending mcmifolds. Assume f is 
a Morse function satisfying condition (D). When ||/ — /Hamax sufficiently 
small, ^ 

Haus(B,B) = 0(||/-/||i,,,ax)- (28) 


This theorem shows that the boundaries of descending d-manifolds for two Morse 
functions are close to each other and the difference between the boundaries is 
controlled by the rate of the first derivative difference. 

Similarly to descending manifolds, we can define all the analogous quantities 
for ascending manifolds. We introduce the following assumption: 


Assumption (A): We assume Bmax = naa-Xx^dAg Amax(AIv(a;)) < 0. 


Note that Aniax(Af) denotes the largest eigenvalue of a matrix M. If M is a 
scalar, Xmax{M) = M. Under assumption (A), we have a similar stability result 
(Theorem 1) for ascending manifolds. Assumptions (A) and (D) together imply 
the stability of d-cells. 

Theorem 1 can be applied to nonparametric density estimation. Our goal is to 
estimate the boundary of the descending d-manifolds, B, of the unknown popu¬ 
lation density function p. Our estimator is B„, the boundary of the descending 
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d-manifolds to a nonparametric density estimator e.g. the kernel density esti¬ 
mate Pn- Then under certain regularity condition, their difference is given by 

Haus = 0 {\\pn-p\\i ,max) ■ 

We will see this result in the next section when we discuss mode clustering. 

Similar reasoning works for the nonparametric regression case. Assume that 
we are interested in B, the boundary of descending d-manifolds, for the regres¬ 
sion function m{x) = E(y|A = x). And our estimator B is again a plug-in 
estimate based on m„{x), a nonparametric regression estimator (e.g., kernel 
estimator). Then under mild regularity conditions, 

Haus (^Bn, B^ = O (||m„ - m\\ i^max) • 


4-2. Consistency of Mode Clustering 


A direct application of Theorem 1 is the consistency of mode clustering. Let 
be the a-th derivative of K and let BC’’ denote the collection of functions 
with bounded continuously derivatives up to the r-th order. We consider the 
following two assumptions on the kernel function: 

(Kl) The kernel function K £ BC^ and is symmetric, non-negative and 

for all a = 0,1, 2, 3. 

(K2) The kernel function satisfies condition Ki of Gine and Guillou (2002). That 

is, there exists some A,v > 0 such that for all 0 < e < 1, supg N{IC, L 2 {Q), Cks) 
(y)" , where N(T,d,e) is the e—covering number for a semi-metric space 
(T, d) and 

JC= :a:GK'^,h>0, |a| =0,1,2 


(Kl) is a common assumption; see Wasserman (2006). (K2) is a weak assump¬ 
tion guarantee the consistency for KDE under Lao norm; this assumption first 
appeared in Gine and Guillou (2002) and has been widely assumed (Einmahl 
and Mason, 2005; Rinaldo et ah, 2010; Genovese et al., 2012; Rinaldo et ah, 
2012; Genovese et ah, 2014; Chen et ah, 2015). 


Theorem 2 (Consistency for mode clustering) Letp,pn he the density func¬ 
tion and the KDE. Let B and Bn be the boundaries of clusters by mode clus¬ 
tering over p and Pn respectively. Assume (D) for p and (Kl-2), then when 
0, h —)• 0, 


log n 


ih^+^ 


Haus [Bu.B^ = 0 {\\pn -plli.max) = 0 {h^) + Op 


( I \og{n) \ 

yV nh'^+‘^ j 
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The proof is simply to combine Theorem 1 and the rate of convergence for 
estimating the gradient of density using KDE (Theorem 8). Thus, we omit the 
proof. Theorem 2 gives a bound for the rate of convergence for the boundaries 
for mode clustering. The rate can be decomposed into two parts, the bias 0{h'^) 

and the (square root of) variance Op ■ This rate is the same for the 

>Coo“loss of estimating the gradient of a density function, which makes sense 
since the mode clustering is completely determined by the gradient of density. 

Another way to describe the consistency for mode clustering is to show that 
the proportion of data points that are incorrectly clustered (mis-clustered) con¬ 
verges to 0. This can be quantified by the use of Rand index (Rand, 1971; Hubert 
and Arabie, 1985; Vinh et ah, 2009), which measures the similarity between two 
partitions of the data points. Let dest(a;) and destn(x) be the destination of 
gradient of the true density function p{x) and the KDE Pnix)- For a pair of 
points X, y, we define 


T- . ^ f 1 if dest(x) = dest('i/) o . ^ I 1 if dest„(x) = dest„(v) 

= ifd5„W^dSrM») 

(29) 

Thus, 'I'(x, y) = 1 a x,y are in the same cluster and 0 if they are not. The Rand 
index for mode clustering using p versus using pn is 


rand (p„,p) = 1 - ^ |«'(Ai, A^-) - $„(Xi, A^-) 


(30) 


which is the proportion of pairs of data points that the two clustering results 
disagree on. If two clusterings output the same partition, the Rand index will 
be 1. 

Theorem 3 (Bound on Rand Index) Assume (D) for p and (Kl-2). Then 
when —>■ 0, h —)■ 0, the adjusted Rand index 


rand {pn,p) = 1 - 0{h‘^) - Or 


( I log{n) \ 
nh^+'^ j 


Theorem 3 shows that the Rand index converges to 1 in probability, which 
establishes the consistency of mode clustering in an alternative way. Theo¬ 
rem 3 shows that the proportion of data points that are incorrectly assigned 
(compared with mode clustering using population p) is bounded by the rate 

0{h'^)-\-Or asymptotically. 

Azizyan et al. (2015) also derived the convergence rate of the mode clustering 
for the rand index. Here we briefly compare our results to theirs. Azizyan et al. 
(2015) consider a low-noise condition that leads to a fast convergence rate when 
clusters are well-separated. Their approach can even be applied to the case of 
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increasing dimensions. In our case (Theorem 3), we consider a fixed dimension 
scenario but we do not assume the low-noise condition. Thus, the main difference 
between Theorem 3 and the result in Azizyan et al. (2015) is the assumptions 
being made so our result complements the findings in Azizyan et al. (2015). 


4-3. Consistency of Morse-Smale Regression 


In what follows, we will show that m„_ivisR(a;) is a consistent estimator of mMSR(a;)- 
Recall that 

?nMSR(a^) = ye + Pjx, for x G Ee, (31) 

where Ei is the d-cell defined on m and the parameters are 

iye,Pe) = argmin E {{Y - y - (3'^Xf\X G Ee) . ( 32 ) 

And fhn, MSR is the two-stage estimator to mMSR(a^) defined by 

mn,MSR(a;) =ye + '^x, for a; G Ei, (33) 


where {E^ : £ = 1 , • • • , L} are the collection of cells of the pilot nonparametric 
regression estimator and yg, j3i are the regression parameters from equation 
( 12 ): 


{fie, Pe) = argmin 


^ {Y,-y-P^Xif. 

i'-XiGEf 


(34) 


Theorem 4 (Consistency of Morse-Smale Regression) Assume (A) and 
(D) for m and assume m is a Morse-Smale function. Then when 
0 , h —>■ 0 , we have 


|wMSR(a;) 


Wn,MSR(a^)| = Op 



+ 0 (||m„ 


^11 l,max) 


(35) 


uniformly for all x except for a set N„ with Lebesgue measure Op(||m„—m||i_niax), 

Theorem 4 states that when we have a consistent pilot nonparametric regression 
estimator (such as the kernel regression), the proposed MSR estimator converges 
to the population MSR. Similarly as in Theorem 6 , the set N„ are regions around 
the boundaries of cells where we cannot distinguish their host cell. Note that 
when we use the kernel regression as the pilot estimator to„. Theorem 4 becomes 


|wMSR(a;) - m„,MSR(a;)| = 0{h^) + Op 

under regular smoothness conditions. 

Now we consider a special case where we may obtain parametric rate of 
convergence for estimating tumsr- Let £ = d {Ei IJ • • • IJ El) be the boundaries 
of all cells. We consider the following low-noise condition: 

P(A G £0e) < Ae^, 


( I log^' 


(36) 





Chen et al./Inference using the Morse-Smale 


23 


for some A,/3 > 0. Equation (36) is Tsybakov’s low noise condition (Audibert 
et al., 2007) applied to the boundaries of cells. Namely, (36) states that it is 
unlikely to many observations near the boundaries of cells of m. Under this 
low-noise condition, we obtain the following result using kernel regression. 


Theorem 5 (Fast Rate of Convergence for Morse-Smale Regression) 

Let the pilot estimator fhn he the kernel regression estimator. Assume (A) 
and (D) for m and assume m is a Morse-Smale function. Assume also (36) 
holds for the covariate X and (Kl-2) for the kernel function. Also assume that 

— Q I I j j _ j’/jgj.j uniformly for all x except for a set with 

Lebesgue measure Op 



|»T^MSR(a^) — mn,MSR(a:)| = Op + Op 

Therefore, when ft > , we have 

|TOMSR(a;) - m.„^MSR(a::)| = Op 


logny^/(‘^+^)^ 


(37) 


(38) 


Theorem 5 shows that when the low noise condition holds, we obtain a fast 
rate of convergence for estimating mMSR- Note that the pilot estimator m„ does 
not ahve to be a kernel estimator; other approaches such as the local polynomial 
regression will also work. 


f.f. Consistency of the Morse-Smale Signature 

Another application of Theorem 1 is to bound the difference of two Morse- 
Smale signatures. Let / be a Morse-Smale function with cells Ei,..., E^. Recall 
that the Morse-Smale signatures are the bipartite graph and summary statistics 
(locations, density values) for local modes, local minima, and cells. It is known in 
the literature (see, e.g.. Lemma 9) that when two functions /, / are sufficiently 
close, then 

max||dj- - CjW = O (\\f- /111,max) , max||/(cj) - /(cj)|| = O (\\f- /||oo) , 

(39) 

where Cj , Cj are critical points / and / respectively. This implies the stability of 
local modes and minima. 

So what we need is the stability of the summary statistics ( 77 ], 7 ]) associated 
with the edges (cells). Recall that these summaries are defined through (14) 

= argmin / {f{x)-r]-j'^xfdx. 

1,1 J Et 
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For another function /, let (vfll) be its signatures for cell Eg. The following 
theorem shows that if two functions are close, their corresponding Morse-Smale 
signatures are also close. 

Theorem 6 Let f be a Morse-Smale function satisfying assumptions A and D, 
and let f be a smooth function. Then when t 0, —>■ 0, after relabeling 

the indices of cells of f, 


Theorem 6 shows stability of the signatures Note that Theorem 6 

also implies that the stability of piecewise approximation 

\fus{x)-Jus{x)\=o{\\J-f\\l^^/. 

Together with the stability of critical points (39), Theorem 6 proves the stability 
of Morse-Smale signatures. 

4.4-1. Example: Morse-Smale Density Estimation 

As an example for Theorem 6 , we consider density estimation. Let p be the 
density of random sample Xi , • • • , Al„ and recall that Pn is the kernel density 
estimator. Let be the signature for p under cell Eg and be the 

signature for under cell Eg. The following corollary guarantees the consistency 
of Morse-Smale signatures for the KDE. 

Corollary 7 Assume (A,D) holds for p and the kernel function satisfies (Kl- 
2). Then when t 0, /i —)■ 0, after relabeling we have 

max p \\\, llyj - y]||} = 0 {hf) + Op ■ 


The proof to Corollary 7 is a simple application of Theorem 6 with the rate of 
convergence for the first derivative of the KDE (Theorem 8 ). So we omit the 

proof. The optimal rate in Corollary 7 is Op when we choose h to 

be of order O . 

Remark 2 When we compute the Morse-Smale approximation function, we 
may have some numerical problem in low-density regions because the density 
estimate Pn may have unbounded support. In this case, some cells may be un¬ 
bounded, and the majority of these cells may have extremely low density value, 
which makes the approximation function 0. Thus, in practice, we will restrict 
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ourselves only to the regions whose density is above a pre-defined threshold A so 
that every cell is bounded. A simple data-driven threshold is X = 0.05 sup^ 

Note that Theorem 7 still works in this case but with a slight modification: the 
cells are define on the regions {x : Ph{x) > 0.05 x sup 2 ,ph(a:)}. 

Remark 3 Note that for a density function, local minima may not exist or the 
gradient flow may not lead us to a local minimum in some regions. For instance, 
for a Gaussian distribution, there is no local minimum and except for the center 
of the Gaussian, if we follow the gradient descent path, we will move to infinity. 
Thus, in this case we only consider the boundaries of ascending 0-manifolds 
corresponding to well-defined local minima and assumptions (A) is only for the 
boundaries corresponding to these ascending manifolds. 

Remark 4 When we apply the Morse-Smale complex to nonparametric density 
estimation or regression, we need to choose the tuning parameter. For instance, 
in the MSR, we may use kernel regression or local polynomial regression so we 
need to choose the smoothing bandwidth. For the density estimation problem 
or mode clustering, we need to choose the smoothing bandwidth for the kernel 
smoother. In the case of regression, because we have the response variable, we 
would recommend to choose the tuning parameter by cross-validation. For the 
kernel density estimator (and mode clustering), because the optimal rate depends 
on the gradient estimation, we recommend choosing the smoothing bandwidth 
using the normal reference rule for gradient estimation or the cross-validation 
method for gradient estimation (Duong et al., 2007; Ghacon et al., 2011). 

5. Discussion 

In this paper, we introduced the Morse-Smale complex and the summary sig¬ 
natures for nonparametric inference. We demonstrated that the Morse-Smale 
complex can be applied to various statistical problems such as clustering, re¬ 
gression and two sample comparisons. We showed that a smooth multivariate 
function can be summarized by a few parameters associated with a bipartite 
graph, representing the local modes, minima and the complex for the underly¬ 
ing function. Moreover, we proved a fundamental theorem about the stability 
of the Morse-Smale complex. Based on the stability theorem, we derived con¬ 
sistency for mode clustering and regression. 

The Morse-Smale complex provides a method to synthesize both paramet¬ 
ric and nonparametric inference. Compared to parametric inference, we have a 
more flexible model to study the structure of the underlying distribution. Com¬ 
pared to nonparametric inference, the use of the Morse-Smale complex yields a 
visualizable representation for the underlying multivariate structures. This re¬ 
veals that we may gain additional insights in data analysis by using geometric 
features. 

Although the Morse-Smale complex has many potential statistical applica¬ 
tions, we need to be careful when applying it to a data set whose dimension 
is large (say d > 10). When the dimension is large, the curse of dimensionality 
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kicks in and the nonparametric estimators (in both density estimation problems 
or regression analysis) are not accurate so the errors of the estimated Morse- 
Smale complex can be huge. 

Here we list some possible extensions for future research: 

• Asymptotic distribution. We have proved the consistency (and the rate of 
convergence) for estimating the complex but the limiting distribution is 
still unknown. If we can derive the limiting distribution and show that 
some resampling method (e.g. the bootstrap Efron (1979)) converges to 
the same distribution, we can construct confidence sets for the complex. 

• Minimax theory. Despite the fact that we have derived the rate of con¬ 
vergence for a plug-in estimator for the complex, we did not prove its 
optimality. We conjecture the minimax rate for estimating the complex 
should be related to the rate for estimating the gradient and the smooth¬ 
ness around complex (Audibert et ah, 2007; Singh et ah, 2009). 
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Appendix A: Appendix: Proofs 

First, we include a Theorem about the rate of convergence for the kernel density 
estimator. This Lemma will be used in deriving the convergence rates. 

Theorem 8 (Lemma 10 in Chen et al. (2015); see also Genovese et al. (2014)) 

Assume (Kl-2) and that logn/n < h'^ < b for some 0 < 6 < 1. Then we have 

IP.= 0(1.^) + o, 


/or f = 0,1,2. 

To prove Theorem 1, we introduce the following useful Lemma for stability 
of critical points. 

Lemma 9 (Lemma 16 of Chazal et al. (2014)) Let p be a density with com¬ 
pact support K. o/K'^. Assume p is a Morse function with finitely many, distinct, 
critical values with corresponding critical points C = {ci, • • • ,Cfc}. Also assume 
that p is at least twice differentiable on the interior of K, continuous and dif¬ 
ferentiable with non vanishing gradient on the boundary o/K. Then there exists 
eo > 0 such that for all 0 < e < cq the following is true: for some positive 
constant c, there exists rj > ccq such that, for any density q with support K 
satisfying Up - glia,max < V, we have 

1. q is a Morse function with exact k critical points c(, • • • , cj. and 
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Fig 10. Diagram for lemmas and Theorem 1. 


2. after suitable relabeling the indiees, maxj^i... \\ci — c'|| < e. 

Note that similar result appears in Theorem 1 of Chen et al. (2016). This lemma 
shows that two close Morse functions p, q will have similar critical points. 

The proof of Theorem 1 requires several working lemmas. We provide a chart 
for how we are going to prove Theorem 1. 

First, we define some notations about gradient flows. Recall that G K 
is the gradient (ascent) flow starting at x: 

7r^(0) = x, Tr^(t) = g(Tr^(t)). 

For X that is not on the boundary set we define the time: 

te(x) = infjt : 7ra;(s) S B{m, ^/e), for alls > t}, 

where m is the destination of Tr^,. That is, tg{x) is the time to arrive the regions 
around a local mode. 

First, we prove a property for the direction of the gradient field around bound¬ 
aries. 

Lemma 10 (Gradient field and boundaries) Assume the notations in The¬ 
orem 1 and assume f is a Morse function with bounded third derivatives and 
satisfies assumption (D). Let s{x) = x — 11^, where G B is the projected 
point from x onto B (when flj, is not unique, just pick any projected point). For 
any q G B, let x be a point near q such that x — q GY{q), the normal space of 
B at q. Let 6{x) = ||a: — g|| and e{x) = ]|f5||y denote the unit vector. Then 
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Fig 11. Illustration for Lemma 10 and 11. (a): We show that the angle between projection 
vector s{x) and the gradient g{x) is always right whenever x is closed to the boundaries B. (b): 
According to (a), any gradient flow line start from a point x that is close to the boundaries 
(distance < 5i), this flow line is always moving away from the boundaries when the current 
location is close to the boundaries. The flow line can temporally get closer to the boundaries 
when it is away from boundaries (distance > 5i) 


1. For every point x such that 


d{x, B) < 6i = 


2H„ 


(P 


\ 3,max 


we have 


g{xfs(x) > 0. 


That is, the gradient is pushing x away from the boundaries. 
2. When 


£(x) = e{xY'g{x) > -H^^^S^x). 


Proof. 

Claim 1. Because the projection of x onto B is IIj,, s{x) € V(n 2 :) and 
s(a:)^(/(n 2 ;) = 0 (recall that for p G B, V(p) is the collection of normal vectors 
of B at p). 

Recall that d{x,B) = ||s(x)|| is the projected distance. By the fact that 
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s{x)'^g{Il^) = 0, 


s{x)'^g{x) 


= s{xf {g{x) - g(n^)) 

> s{x)'^H{Il^)s{x) - y I 

>d{x,Bf 


fh,ma.^d{x,Bf 
s{x) d? 


d{x,B) 2"''" 
ni3.maxd{a;, S) j . 


(Taylor’s theorem) 

3,max^(^; B^ 


(40) 

Note that we use the vector-value Taylor’s theorem in the first inequality and 
the fact that for two close points x, y, the difference in the j-the element of 
gradient gj{x) — gj{y) has the following expansion 


gjix)-gjiy) = Hj{yf{x 


> Hj{yf{x 


> Hj{yf{x 


y)+ {u- y)Tj{u)du 

Ju^y 

y) - ^sup||Tj(u)||2||a:-j/f 

y) - yll/IU.maxlk- yf, 


where Hj{y) = ’^gjiy) and Tj{y) = Wgj{y) is the Hessian matrix of gjiy), 
whose elements are the third derivatives of /(y). 

Thus, when d(x, B) < , s{x)"^g{x) > 0, which proves the first claim. 

Claim 2. By definition, e{x)^g{q) = 0 because g{q) is in tangent space of B 
at q and e{x) is in the normal space of B at q. Thus, 


iix) = e{xf' g{x) 

= e{xf {g{x) - g{q)) 

> e{xfH{q){x - q) - y ||/|| 3 .max||a: - qf 
= e{x)'^H{TT{x))e{x)6{x) - y ||/|| 3 .max^(a;)^ 

whenever 6{x) = \\x — y|| < —• Note that in the first inequality we use 

the same lower bound as the one in claim 1. Also note that x — q = e(x)S(x) 
and e(x) is in the normal space of B at 7r(x) so the third inequality follows from 
assumption (D). 

□ 


Lemma 10 can be used to prove the following result. 






Chen et al./Inference using the Morse-Smale 


30 


Lemma 11 (Distance between flows and boundaries) Assume the nota¬ 
tions as the above and assumption (D). Then for all x such that 0 < d(a;, B) — 
A < A. - -_ 

d{TT^{t),B) > S, 


for all t > 0. 


The main idea is that the projected gradient (gradient projected to the normal 
space of nearby boundaries) is always positive. This means that the flow cannot 
move “closer” to the boundaries. 


Proof. By Lemma 10, for every point x near to the boundaries {d{x,B) < 
^i), the gradient is moving this point away from the boundaries. Thus, for any 
flow TTxft), once it touches the region B 0 i5i, it will move away from this region. 
So when a flow leaves B (B Si, it can never come back. 

Therefore, the only case that a flow can be within the region B (B Si is that 
it starts at some x £ B (B Si. i.e. d{x,B) < (5i. 

Now consider a flow start at x such that 0 < d{x,B) < (5i. By Lemma 10, 
the gradient g{x) leads x to move away from the boundaries B. Thus, whenever 
■Kxit) £ B(BSi, the gradient is pushing n^it) away from B. As a result, the time 
that TTxit) is closest to B is at the beginning of the flow .i.e. t = 0. This implies 
that d{'Kx{f),B) > d{TTx{Q),B) = d{x,B) = S. 

D 


With Lemma 11, we are able to bound the low gradient regions since the 
flow cannot move infinitely close to critical points except its destination. Let 
Amin > 0 be the minimal ‘absolute’ value of eigenvalues of all critical points. 

Lemma 12 (Bounds on low gradient regions) Assume the density func¬ 
tion f is a Morse function and has bounded third derivatives. Let C denote 
the collection of all critical points and let Amin Is the minimal ‘absolute ’ eigen¬ 
value for Hessian matrix H{x) evaluated at x £ C. Then there exists a constant 
^2 > 0 such that 


G{S)^ 


X ■ || 5 (a:)|| < 



C C0(5 


(42) 


for every S < S2. 

Proof. Because the support IK is compact and x € IK >-)• || 5 (x)|| is contin¬ 
uous, for any (70 > 0 sufficiently small, there exists a constant R{go) > 0 such 
that 

^ 1 ( 50 ) = {x : ||ff(a;)|| < 50 } C C 0 R{go) 
and when go -£ 0 , R{go) —t 0 . Thus, there is a constant 51 > 0 such that 
= 2dhi/Tkmax ■ 
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The set C © — has a useful feature: for any x & C ^ ^ 


211/113,„ 


\H{x) - H{c)\\f = ||(a; - c)f^^\c + t{x - c))dt\\j 


f 

<d^ 


< d'^lla:- c||||/||3,max 

'^min II /. 


2d3||/ll3.. 


I 3,max 


A„ 


where is a, d x d x d array of the third derivative of / and ||A||f is the 
Frobenius norm of the matrix A. By Hoffman-Wielandt theorem (see, e.g., page 
165 of Bhatia 1997), the eigenvalues between H{x) and H{c) is bounded by 
\\H{x) — H{c)\\f. Therefore, the smallest eigenvalue of H{x) must be greater 
than or equal to the smallest eigenvalue of H{c) minus Amm. Because A^in is 
the smallest absolute eigenvalues of H{c) for all c S C, the smallest eigenvalue 
of H{x) is greater than or equal to Amin ^ Jqj. all x € C © R{gi) = C © 2 ^ 3 |f /||3 -• 

Using the above feature and the fact that Gi{gi) C C © 2 d 3\/\\3 -’ 

X G Gi(gi), we have the following inequalities: 

91 > 115(3^)11 

r-l 

{x — c)H{c + t{x — c))dt 


/ 


— 11^ ^11 2 Amin- 


Thus, ||x — c|| < which implies 




2 gi 

Amin 


Moreover, because Gi{g 2 ) C Gi(g 3 ) for any 52 < 53 , any 52 < gi satisfies 

2^2 


Gi(g2)cC< 


An 


Now pick S = y-^, we conclude 


Gi 


26 


for all 


6 = ^< 
'^min 


= G{6) c C i 

2 gi 


Amir 


= ^2 


( 43 ) 


where gi is the constant such that R{gi) = 2 d 3||/||3 

□ 
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Fig 12. Illustration for 'H(e, <5). The thick black lines are boundaries B; solid dots are local 
modes; box is local minimum; empty dots are saddle points. The three purple lines denote 
possible gradient flows starting from some points x with d(x, B) = 6. The gray disks denote 
all possible regions such that H^H < 5. Thus, the amount of gradient within the set'H{e,S) 

is greater or equal to <5. 


Lemma 13 (Bounds on gradient flow) Using the notations above and as¬ 
sumption (D), let (5i be defined in Lemma 11 and 62 be defined in Lemma 12, 
equation (43). Then for all x such that 


d{x,B) = S <So = mini 61 , 62 , 
and picking e such that 62 > e^ > 6 , we have 

»7e(a:^)= inf ||ff(7r:r(i))|| > <5^^. 

Moreover, 

le{6) = inf pfix) > 6^^, 
xGBs Z 

where Bs = {x : d(x,B) = (5}. 


Proof. 

We consider the flow 'Kx starting at x (not on the boundaries) such that 
d{x, B) = 6 < min {di, 62 } ■ 

For 0 < t < tfix), the entire flow is within the set 


Hie, 6) = {x : d{x,B) > 5,d{x,M) > 


( 44 ) 
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That is, 

{iTxit) : 0 <t <t^{x)} C'H{e,S). (45) 

This is because by Lemma 11, the flow line cannot get closer to the boundaries 
B within distance <5, and the flow stops when its distance to its destination is 
at e. Thus, if we can prove that every point within 'H(e, i5) has gradient lowered 
bounded by we have completed the proof. That is, we want to show that 

Jlff(2;)|| > <5%^. (46) 

xen/.s) 2 

To show the lower bound, we focus on those points whose gradient is small. 
Let 

S{S) = ja; : |!g(a;)|| < . 

By Lemma 12, the S{S) are regions around critical points such that 

S{6) cC®S. 

Since we have chosen e such that e > and by the fact that critical points 
are either in M, the collection of all local modes, or in B the boundaries so that, 
the minimal distance between 'H(e, S) and critical points C is greater that 6 (see 
equation (44) for the definition of H(e, <5)). Thus, 

(C0(5)n'H(e,(5) = 0, 


which implies equation (46): 


inf llff(^)ll><5^ 

xen/.s) 2 


Now by the fact that all with d{x,B) < 6 are within the set T-L{e,5) 

(equation (45)), we conclude the result. 

□ 


Lemma 13 links the constant 7e(5) and the minimal gradient, which can be 
used to bound the time t^ix) uniformly and further leads to the following result. 

Lemma 14 Let K(5) = {a: € K : d{x, B) > 6} = K\(i30(5) and i5o be defined as 
Lemma 13 and M is the eollection of all local modes. Assume that f has bounded 
third derivative and is a Morse function and that assumption (D) holds. Let f 
be another smooth function. There exists constants c*,co,ci,eo that all depend 
only on f such that when (e, 6) satisfy the following condition 

S < e < Cq, 6 < min{(5o, Haus(IK((5), B(M, y/e))} (47) 


and if 


II/- /lls.max < Co 

11/- /111,max < Cl exp 


4v^||/||2.max||/||c 


(48) 
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Fig 13. Result from Lemma 13: lower bound on minimal gradient. This plot shows possible 
values for minimal gradient r}e{x) (pink regions) when d{x^B) is known. Note that we have 
chosen < ^ 2 . 


then for all x C K(5) 


II lim 7ra;(i) 

i—>-oo 


lim nx{t)\\ < c» 

t—>-oo 




(49) 


Note that condition (47) holds when (e, d) are sufficiently small. 

Proof. The proof of this lemma is closely related to the proof of Theorem 
2 of Arias-Castro et al. (2016). The results in Arias-Castro et al. (2016) is a 
pointwise convergence of gradient flows; now we will generalize their findings to 
the uniform convergence. 

Note that K{d) = S) U B{x, ^/e). For x S B{x, yi), the result is trivial 
when e is sufficiently small. Thus, we assume x € ?i(e,S). 

From equation (40-44) in Arias-Castro et al. (2016) (proof to their Theorem 

2 ), 


II lim 7r^(t) - lim TTx{t)\\ 

t—XX) t—^X 


< 



11/- /||i.„.axev^ll/ll=—+ 211/ - 

V Ct||/||2,max 


(50) 


under condition (48) and e < eo for some constant eg. 

Thus, the key is to bound te{x). Recall that x S 'H{e,d). Now consider the 
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gradient flow and deflne z = T:x{tg{x)). 


fiz) - fix) = f 

Jo 


ds 


ds= g{7Tx{s))'^7T'^{s)ds 

Jo 


rteix) 


(51) 


\\9i'^xis))fds > -f^iS)\ix). 


Since f{z) - f{x) < 2||/||oo, we have 


and by Lemma 13, 


loo > -'ye{S)H^{x) 




7. (<5)2 - 52^2 

for all X G H(e, S). 

Now plug-in (52) into (50), we have 


(52) 


lim TT^it)- lim 7ra;(t)|| < V aoe + ai\\f f\\ l,maxe 

£—foo £—foo * 


Vdll/lh 


8 |l/llc 


0211 /- /lie 
(53) 

for some constants 00 , 01 , 02 - Now using condition (48) to replace the second 
term of right hand side, we conclude 


II lim 7 r^(t) - lim 7 r,^(t)|| < aaJe+Wf-fWl 

i—>-oo t—>-oo V ’ 

for some constant 03 . 

By Lemma 7 in Arias-Castro et al. (2016), there exists some constant C 3 such 


that when 03^6 + 11 / - /ll/max < Vca, 

II lim TT^it) - lim n^it)\\ < V^c^Wf - , 


Thus, when both e and ||/ — /s^maxll sufficiently small, there exists some 
constant c* such that 

II lim TT^it) - lim n^it)\\ < c*||/ - /|| 

t—^oo t—^oo 

for all X G H(e, S). 

□ 


Now we turn to the proof of Theorem 1. 

Proof of Theorem L The proof contains two parts. In the first part, 
we show that when ||/ — /Hs^niax is sufficiently small, we have Haus(i3,i3) < 
■ 2 ||?||’‘° —, where B and B are the boundary of descending d-manifolds for / 
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and /. The second part of the proof is to derive the convergence rate. Because 
Haus(i3,i3) < —) we can apply the second assertion of Lemma 10 to 

derive the rate of convergence. Note that C and C are the critical points for / 
and / and M = Cq, M = Cq are the local modes for / and /. 

Part 1: Haus(i3,i3) < ^ 2 . 11 ^ 1 * 3" —5 the upper bound for Hausdorff dis¬ 
tance. Let (T = min{||a; — y\\ : x,y G M,x /=■ y}. That is, a is the smallest dis¬ 
tance between any pair of distinct local modes. By Lemma 9, when ||/ —/II 3 max 
is small, / and / have the same number of critical points and 

Haus(C,C) < A\\f - /II*,max < ^11/ - /IlS.max, 

where A is a constant that depends only on / (actually, we only need ||/—/II 2 max 
to be small here). 

Thus, whenever ||/ — /Ha^max satisfies 

ll/-/IIW<^> (54) 

every M has an unique corresponding point in M and vice versa. In addition, 
for a pair of local modes : rrij G M, fhj G M, their distance is bounded 

by \\mj-mj\\ < f. 

Now we pick (e,(5) such that they satisfy equation (47). Then when ||/ — 
/II 3 max i® sufficiently small, by Lemma 14, for every x G ’H(e, S) we have 

II lim n^it) - lim 5'^(t)|| < c*i/||/ - /||oo < c»J\\f - f\\l 

>-oo * V ’ 

Thus, whenever 

ll/-/IIW<^(|)'. (55) 

TTxit) and TTx{t) leads to the same pair of modes. Namely, the boundaries B 
will not intersect the region 'H{e,6). And it is obvious that B cannot intersect 
B{M, ^/e). To conclude, 

BCH{e,5)=^ 

B n B{M, Te) = 0 (56) 

^ 5 n K((5) = 0 , 

because by definition, ]K(i5) = 77(6, 6) n B{M, y/e). 

Thus, B C K((5)‘" = 5 0 (5, which implies Haus(i3,i3) < i5 < ^ 2 ||^|| 3 " — (note 
that S < So < ^ 211 ^ 1 !*^° — appears in equation (47) and Lemma 13). 

Part 2: Rate of convergence. To derive the convergence rate, we use proof 
by contradiction. Let q G B,q G B a, pair of points such that their distance 
attains the Hausdorff distance Haus (^B, B^ . Namely, q and q satisfy 

lk-9ll = Haus 
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and either q js the projected point from q onto B or q is the projected point 
from q onto B. 

Recall that V(a;) is the normal space to R at x G R and we define V(x) 
similarly for x € B. An important property of the pair q, q is that q — q G 
Y{q),Y{q). If this is not true, we can slightly perturb q (or q) on B (or B) to 
get a projection distance larger than the Hausdorff distance, which leads to a 
contradiction. 

Now we choose x to be a point between q, q such that x = ^q + ^q. We 
define e(x) = and e(x) = |||~^|| . Then e(x) G ¥((?) and e(x) G V(§) and 

e(x) = —e(x). 

By Lemma 10 (second assertion). 


e{x) = e(x)^g(x) > - x|| > 0 

I{x) = e{xY'g{x) > x\\ > 0. 


(57) 


Thus, for every x between g, q, 

e(x)^g(x) > 0, ,e(x)'^g(x) =-e(x)^g(x) < 0. (58) 

Note that we can apply Lemma 10 to / and its gradient because when ||/ — /II 2 
is sufhciently small, the assumption (D) holds for / as well. 

To get the upper bound of ||g—gl| = Haus(i3, B), note that ||g—x|| = |||g—g]], 
so 

e(x)^g(x) = e(x)^(g(x) - g{x)) + e{xfg(x) 

> e{xfg{x) - II/- /111,max 

> ^i7min|k - a;|| - 11/-/||i,max (By Lemma 10) 

= ^-f^minlk- gll - II/- /lll.max- 

Thus, as long as 


Haus(i3,i3) = ||g 


> 3 


II/-/111,. 

Tlmin 


we have e(x)^g(x) > 0, a contradiction to equation (58). Hence, we conclude 
that _ 

Haus(R,R) < 3 H/- /llbmax ^ ^ ^ ^ 

^min I 

□ 


Proof of Theorem 3. 

To prove the asymptotic rate of the rand index, we assume that for every local 
mode of p, there exists one and only one local mode of Pn that is close to the 
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specific mode of p. By Lemma 9, this is true when ||p„ — pjjg is sufficiently 
small. Thus, after relabeling, the local mode rhi of is an estimator to the 
local mode of p. Let Wi be the basin of attraction to fhi using and Wi 
be the basin of attraction to using Vp. Let Al\B = {x : x € A, x ^ B}[j{x : 
X & B,x ^ A] he the symmetric difference between sets A and B. The regions 



(60) 


are where the two mode clustering disagree with each other. Note that En are 
regions between the two boundaries Bn and B 
Given a pair of points Xi and Xj , 


^ ^n{X,,X,) X, or Xj e En 
By the definition of rand index (30), 


(61) 



(62) 


Thus, if we can bound the ratio of data points within Em we can bound the 
rate of rand index. 

Since K is compact and p has bounded second derivatives, the volume of En 
is bounded by 



(63) 


Note Vol(^) denotes the volume (Lebesgue measure) of a set A. We now con¬ 
struct a region surrounding B such that 


EndB® Haus(B„, B) = 14 


(64) 


and 



(65) 


Now we consider a collection of subsets of K: 


V = {B 0 r : i? > r > 0}, 


( 66 ) 


where i? < oo is the diameter for K. For any set 4 C K, let P{Xi G A) and 
Pn{A) = - € ^) denote the probability of an observation within A 

and the empirical estimate for that probability, respectively. It is easy to see 
that 14 G V for all n and the class V has a finite VC dimension (actually, the 
VC dimension is 1). By the empirical process theory (or so-called VC theory, 
see e.g. Vapnik and Chervonenkis (1971)), 



( 67 ) 
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Thus, 


PiX,&Vn)-PniVn) = Op 



( 68 ) 


Now by equations (61) and (62), 


1 - rand (p„,p) < 8P„(£;„) < 8P„(t4) < 8P{Xi e K) + Or 



Therefore, 



< supp(a;) X Vol(l/„) + Op 



(70) 



which completes the proof. Note that we apply Theorem 2 in the last equality. 


□ 


Proof of Theorem 4. Let (Xi,Fi),-- - ,( Xn , Yn ) be the observed data. 
Let Ei denote the d-cell for the nonparametric pilot regression estimator fhn- 
With Ig = {i : Xi G Eg}, we define as the matrix with rows Xi, i G Ii and 
similarly we define Y^. 

We define to be the matrix similar to X^ except that the row elements 
are those Xi within Ei, the d-cell defined on true regression function m. We 
also define Yo,£ to be the corresponding Yi. 

By the theory of linear regression, the estimated parameters iiijPe have a 
closed form solution: 


ifieJef = {XjXe)-^XjYi. 


(71) 


Similarly, we define 


ilto^e, Po,e)’^ — {^0,1^0,e) 


(72) 


as the estimated coefficients using Xo,f and Yq,^. 

As ||m — is small, by Theorem 3, the number of rows at which 

Xf and Xo_£ differ is bounded by 0{n x ||to„ — to||i max)- This is because an 
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observation (a row vector) that appears only in one of and is those 
fallen within either Eg or Eg but not both. Despite the fact that Theorem 3 
is for basins of attraction (d-descending manifolds) of local modes, it can be 
easily generalized to 0-ascending manifolds of local minima under assumption 
(A). Thus, the similar bound holds for d-cells as well. Thus, we conclude that 


—'Kf'Kg - "Kq g\Q g 

n n ' 

-XjYg - -XlgYo^g 


— rn.||i^niax) 

= rn.||i^niax) 


(73) 


since (X^, Y^) and (Xo,£, Yq,^) only differ by 0{n x ||to„ — m|ji,niax) elements. 
Thus, 


(mo,£ — Jie, /3o.£ — , 


1 , 


1 . 


-X^ ,Xo,^ - -Xi 


— 0(11777-72 777|| i^niax) ; 


which implies. 


max 


{|!mo. 


t - Ml, \\Po,e ■ 


|| = 0(||m„ - TTrlli^max). 


Now by the theory of linear regression, 


max|||/7o,^ - fii\\,\\Po^g - /3dl} = Op 


1 ^ 

-XjYg 

n 


(74) 

(75) 

(76) 


Thus, combining (75) and (76) and use the fact that all the above bounds are 
uniform over each cell, we have proved that the parameters converge at rate 
0 (||to„ - m||i,max) + Op 

For points within the regions where Eg and Eg agree with each other, the rate 
of convergence for parameter estimation translates into the rate of 777„^iviSR ~ 
wmsr- The regions that Eg and Eg disagree to each other, denoted as N„, have 
Lebesgue 0(||m7j —m||i_niax) by Theorem 1. Thus, we have completed the proof. 
□ 


Proof of Theorem 5. The proof of Theorem 5 is nearly identical to the 
proof of Theorem 4. The only difference is that the number of rows that X^ and 
Xo/ differ is bounded by 0{n x ||r77„ — due to the low noise condition 

(36). Thus, equation (73) becomes 


1 -r 

-XjXg 

77 

-XfY^ 

n 




—XJ^Yq 

■n ’ ' 


0(l|Wn- Wlll.max) 
0(l|Wn- Wlll.max) 


( 77 ) 
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SO the parameter estimation error (76) is 0{\\rhn - 
Under assumption (Kl-2) and using Theorem 
kernel regression), 


8 (the same result works for 


0(||m„ 


mill,max) = o(h^) + Op 


[ / log» 
\^V n/i^+2 


// xl/(d+6)\ 

Thus, with the choice that h = O I i 1 1, we have 0(||m„—TO||i,max) = 

//, s2/(d+6)\ 

’ which proves equation (37). 

□ 


Proof of Theorem 6. 

We first derive the explicit form of the parameters within cell Ei. 

Note that the parameters are obtained by (14); 

(^^ 71 ) = argmin / [f {x) - r] -xf dx. 
vn J Et 


Now we define a random variable Ue G that is uniformly distributed over E£. 
Then equation (14) is equivalent to 


(dj,7() = argmin E ((f(E^) - rj - . 

?7,7 ' 


The analytical solution to the above problem is 


f 1 EiUeV \ 7 E(/(C/,)) \ 

E(C/,) E{UtUl) j \ E{Uif{Ui)) j 


(78) 


(79) 


^Now we consider another smooth function / that is close to / such that 
11/ — / II 3 max is Small so we can apply Theorem 1 to obtain consistency for both 
descending d-manifolds and ascending 0-manifolds. Note that by Lemma 9, all 
the critical points are close to each other and after relabeling, each d-cell Ei of 
/ is estimated by another d-cell E(_ of /. Theorem 1 further implies that 


\-eb{E() - L&b{Ee) 


= o 


(II/-/II 1 ,. 


Leb 


(U.AU,) = 0 ( 11 /-/111,max) , 


(80) 


where Leb(yl) is the Lebesgue measure for set A and AAB = {A\B) U {B\A) is 
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the symmetric difference. By simple algebra, equation (80) implies that 
||E(c 7^) - E(C/,)||oo = o ( 11 /- /||l,„,ax) 

\mUeUl) - E{UeUl)\\^ = O (\\f - /||i,„,ax) 

.: / (81) 
|E(/(C/,)) - E(/(C/,))| = O ( 11 / - 

imujim - E(c/,/(c/,))iu = o (li/- . 

By (81) and the analytic solution to (f)J,7]) from (79), we have proved 



Since the bound does not depend on the cell indices i, (82) holds uniformly for 
all £ = 1, • • • , 77. 

□ 
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